protect-your-child-fake-news.png

Introduction

Fake news is a broad term and in this project our focus will be on covid-19 tweets and online articles whether the infomation presented to the reader is fake or real. In this notebook, we will process the covid19 tweets dataset and the online articles dataset and leverage the knowledge we derived from the exploratory data analysis notebooks we created to fix some issues. Next, we will visualize the data after fixing it and conclude findings. After performing the procedure on both datasets we will merge datasets and conclude our findings in a short way.

Problem understanding

News can be in many different forms. For example, Newspapers, Magazines, TV and radio, Internet, News agencies and Alternative media. Now, there are some businesses that try to promote their products or to advertise a product and some publish fake, attractive or charming news as 'clickbait' and we tend to be curious when it comes to unusual things and so we read / check the content. For instance, in 1835 newspaper company claimed that they have found out creates live in moon (humans with wings) here is link for full article. That as a consequence allowed the company to be more popular which allowed them to promote and advertise more. Now this is one of the reasons why would certain group of people / a person publish fake news or misleading information. However, the influnce on the reader is usually harmful. For example, it could be that fake news regards healthy food suggest or recommend some types of food for people who suffer from kind of illness and if reader followed the suggestions that could be in fact harmful for them since news were fake in the first place. There are more scenarios obviously but this is where the problem lies. Fake news in general are published for the sake of obtaining profit most of the time and readers might believe misleading information which could influence their decisions.

In this research though, the focus will be on the news of covid-19 topic (tweets) and various topics of fake, legit articles published on the internet.

Expectations of this notebook

After reading this notebook you should have:

Table of content

Table of content:

Loading libraries and dataset

Explanation of the loaded libraries

Matplotlib, Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

Seaborn Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Numpy It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

re A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression

nltk NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

spacy spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

langdetect A python library that allows you to identify the language in a given string.

wordcloud Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

CountVectorizer CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

String Python String module contains some constants, utility function, and classes for string manipulation.

preparing covid-19 tweets dataset

In this section of the notebook, I will take the covid 19 tweets dataset in a the following procedure:

  1. Getting rid of unnecessary columns.
  2. Checking whitespaces in columns.
  3. Checking missing values.
  4. Checking duplicated values.
  5. Applying pre-processing to fix issues found in EDA notebook.
  6. Comparing the results of important words before and after the pre-processing.
  7. Prepare the dataset to be merged.

At the end of this section, the tweets dataset should be ready to be merged with the online article dataset.

Getting rid of unnecessary columns

We have found out in the EDA covid 19 tweets notebook that id indicated a unique identifier for each tweet which would not help us predict whether or not a tweet is reliable. Therefore I will be dropping that column from the dataset.

Findings

The dataset does not have the unique 'id' identifier and contains useful columns which we will use to predict the reliability.

Checking whitespace in column names

Findings

The covid19 tweets dataset contain tweet and label columns which does not have any white spaces in their column names. Therefore, this will help us later to merge with the online article dataset.

checking missing values in dataset

Findings

The covid-19 tweets dataset does not have any missing values. That is great because now we do not have to worry about handling missing data.

checking duplicated values in dataset

Findings

The covid-19 tweets dataset does not have any duplicated values.

pre-processing: fixing found issues in EDA notebook

In this sub-section I will use note we derived from covid-19 tweets exploratory data analysis notebook in order to fix text issues.

In the conclusion of covid19 tweets EDA notebook it was mentioned that the following actions would solve noticed problems:

  1. Lowercase all the text
  2. Expand Contractions
  3. Remove hyperlinks
  4. Remove hashtags
  5. Remove punctuations
  6. Remove emojis
  7. Remove digits

The following note taken from the eda notebook as well:

After that we should have clean text, but in order to continue processing it we would need to remove stopwords as they appear often in text corpus while not indicating actual meaning and need to use lemmatization technique in order to bring the words from various tense forms into their base form (example. 'Went' --use lemma--> 'Go')

Lowercase all the text

In NLP, models treat words like Goat and goat differently, even if they are the same. Therefore, to overcome this problem, we lowercase the words. Here, I am using the lower() function available in Python for converting text to lowercase

Findings

The text data has all been lowered successfully.

Expand Contractions

Contractions are the shortened versions of words like don’t for do not and how’ll for how will. These are used to reduce the speaking and writing time of words. We need to expand these contractions for a better analysis of the reviews.

The following dictionary used from this link.

Findings

If you compare the review 9 of the sub-section of lowering all text you would notice that it started with 'i don't' which is now expanded to 'i do not'. Therefore, we can be sure that the contractions have been expanded to proper version.

Used this link to get code for removing hyperlinks from raw text link

Findings

All the hyperlinks have been removed and the previous sample showed some hyperlinks in previous sub-sections but not anymore.

Remove hashtags

By using regular expression I can select all the hashtags that start with # or @ select them and the remove them from the text.

Findings

The hashtags are successfully removed, you can read review 6 in previous sub-sections and above to notice that #covid19 has been removed and therefore the logic has been applied to all the tweets which means that at this point there are no hashtags.

Remove punctuations

Findings

We can notice that there are no dots commas or any punctuations mark being displayed in the text.

Remove emojis

The following link helped me apply a method that removes emojies from raw text link

Findings

Notice in review 14 there used to be this emoji 👇 and now the emoji has been removed which indicates that there are no emojis anymore in the text.

Remove digits

This note taken from the covid19 tweet EDA notebook.

Note: Keep in mind that the model that is going to be created should give results according to keywords of the tweets, thus digits should not matter as covid-19 figures changes and there is no way for the model I am developing to distinguish the 'fake' or 'real' tweets based on the digits.

Findings

We can notice that now there are no digits at all in all tweets. At this point we have the text cleaned.

Removing stopwords and applying lemmatization

The previous sub-section cleaned the data, but it still contains words such as 'The, is, are' that does not add much meaning to the overall corpus and it appear often so we would need to remove them so that we can derive useful information from word cloud data visualization. The words appear in different tense forms as well (past, or present) and by performing lemmatization the word is reversed to its based form according while considering the surrounding context of the word.

Findings

After the text has been cleaned, stop words and the variety of tense forms can affect the results of the model that is going to developed. For example, word like report is treated differently to its past form 'reported'. Therefore the previous step has removed stop words since they usually dont add meaning and words that mean the same thing has been reseversed to the same word and saved in column called 'prepared'

Visualize important words after fixing the text issues

Group all fake tweets together and real tweets together

Create function for generating word clouds

Create matrix that indicate the importance of a word occured in the corpse and correspondingly to the reliablity.

Display the word cloud

Below are the results of EDA notebook (before the pre-processing)

Screenshot%20%28273%29.png

Screenshot%20%28274%29.png

Findings

We can notice that now there are no numbers in the word clouds and http which was emphasised has been disappeared. In addition words such 'report', 'new', 'cases' are strong keywords for real tweet, while words such as claim, cure and lockdown are emphasized and stronger keywords for fake tweet than the version shown in the EDA notebook.

Prepare the covid 19 tweets to be merged

We will be using the prepared version of the text since it will be very useful for the model to learn on good data quality.

The following code will add the value 'tweet' to every row since we will need the type column later on in the preparation for prediction part.

Findings

We have created a subset of the covid-19 tweets which is ready to be merged with the online articles dataset via Uniform merging technique.

preparing online articles dataset

In this section of the notebook, I will take the online articles dataset in a the following procedure:

  1. Getting rid of unnecessary columns.
  2. Checking whitespaces in columns.
  3. Checking missing values.
  4. Checking duplicated values.
  5. Applying pre-processing to fix issues found in EDA notebook.
  6. Comparing the results of important words before and after the pre-processing.
  7. Prepare the dataset to be merged.

At the end of this section, the online articles dataset should be ready to be merged with the online covid19 tweets which have been prepared already.

Getting rid of unnecessary columns

We have found out in the EDA online article notebook that 'Unnamed: 0 ' indicated a unique identifier for each article which would not help us predict whether or not an article is reliable. Therefore I will be dropping that column from the dataset. In addition, since we use the text or the context to determine the reliability the title column currently would not be used. However in future version we could leverage it to use topic modelling for instance. Therefore, it will be dropped for now.

checking whitespace in column names

Findings

The dataset columns does not have white spaces which will allows to merge the datasets easily later on.

checking missing values in dataset

Findings

The dataset does not have any missing values which is good and we do not have to worry about handling missing data with the online articles dataset.

checking duplicated values in dataset

Findings

The dataset does contain duplicated values and I noticed that some articles contain empty text and some lean more towards spam or advertisement which make sense to be duplicated. I also noticed that if we divided the number of duplicated articles by the total number all articles we would notice that the duplicated articles is about 5% (344 / 6335) of the whole dataset which all these factors combined helped to decide that I would be dropping these articles from the dataset.

pre-processing: fixing found issues in EDA notebook

In this sub-section I will use note we derived from online articles exploratory data analysis notebook in order to fix text issues.

In the conclusion of online articles EDA notebook it was mentioned that the following actions would solve noticed problems:

  1. Lowercase all the text
  2. Expand Contractions
  3. Remove punctuations
  4. Remove emojis
  5. Remove digits

The following note taken from the eda notebook as well:

After that we should have clean text, but in order to continue processing it we would need to remove stopwords as they appear often in text corpus while not indicating actual meaning and need to use lemmatization technique in order to bring the words from various tense forms into their base form (example. 'Went' --use lemma--> 'Go')

Lowercase all the text

In NLP, models treat words like Goat and goat differently, even if they are the same. Therefore, to overcome this problem, we lowercase the words. Here, I am using the lower() function available in Python for converting text to lowercase

Findings

We can notice that all the words in the article has been lower cased.

Expand Contractions

Contractions are the shortened versions of words like don’t for do not and how’ll for how will. These are used to reduce the speaking and writing time of words. We need to expand these contractions for a better analysis of the reviews.

The following dictionary used from this link.

Remove punctuations

Remove emojis

The following link helped me apply a method that removes emojies from raw text link

Remove digits

This note taken from online article EDA that indicates that removing digits should be done.

Note: Keep in mind that the model that is going to be created should give results according to keywords of the articles, thus digits should not matter. Distinguishing 'fake' or 'real' articles based on keywords is possible but not with digits included (atleast in my project).

Findings

The dataset text has been cleaned using the same procedure used to pre-processing the covid19 dataset expect amount of actions are a little bit lower due to the fact that articles hardly involve hyperlinks in their body text and hardly involve hashtags.

The dataset text has been cleaned successfuly and is ready for stage where stop words are removed and the rest of the corpus to be lemmatized.

Removing stopwords and applying lemmatization

The previous sub-section cleaned the data, but it still contains words such as 'The, is, are' that does not add much meaning to the overall corpus and it appear often so we would need to remove them so that we can derive useful information from word cloud data visualization. The words appear in different tense forms as well (past, or present) and by performing lemmatization the word is reversed to its based form according while considering the surrounding context of the word.

Findings

After the text has been cleaned, stop words and the variety of tense forms can affect the results of the model that is going to developed. For example, word like report is treated differently to its past form 'reported'. Therefore the previous step has removed stop words since they usually dont add meaning and words that mean the same thing has been reseversed to the same word and saved in column called 'prepared'

Visualizing important words after fixing the text issues

Group all fake online article words together and real online article words

Create function that create word clouds

Create a matrix whereby the importance of the words are indicated corresponding to the reliability.

Display the word cloud

Below find out the results received in EDA notebook

Screenshot%20%28275%29.png

Screenshot%20%28276%29.png

Findings

We can notice that now there are no numbers in the word clouds. In addition words such 'Said', 'go' are revesered back its their base form 'say' and 'go' keywords for real online article. the base form of the word replublician became signficantly stronger for real online article important words. while some words still dominate before and after processing such as 'clinton' for both fake and real online article. Lastly 'Trump' became significantly lower in importance for indicating a real article.

Prepare the covid 19 tweets to be merged

We will be using the prepared version of the text since it will be very useful for the model to learn on good data quality.

The following code will add the value 'article' to every row since we will need the type column later on in the preparation for prediction part.

In order to be consistent in the label column I need to transform the values from captial to small.

Findings

We have created a subset of the online articles. We also transformed the values of the target variable from being captial to small to be consistent with the covid19 tweet dataset. The data subset is ready to be merged with the covid19 tweets dataset via Uniform merging technique.

merging dataset

In this section I will review the datasets and will clarify with an image which technique I will be using for merging.

Both prepared datasets have the same data characterstics and therefore union techniqueis suitable for merging.

Screenshot%20%28277%29.png

We can use the pd.concat method available in pandas in order to merge the datasets in a union fashion on the rows axis (which is the default parameter)

Noticed that the indexes are wrong so lets fix them.

Conclusion

The merged dataset output logical results since the amount of rows in covid tweets was 6420 and the amount of online articles was 5991 while both having 3 features. Therefore the expected combined dataset should be 6420 + 5991 which is 12411 and the features should be the same (we have the exact same features in terms of name and consistency of the target variable.) The indexes of the merged dataset was off and has been successfully fixed.

saving the file

conclusion

To conclude this notebook, after the data has been explored and found out issues in the text. This notebook gone through whitespaces in column names, missing values and duplicated data in both covid19 and online article datasets. In addition, the problems of text data have been resolved and the important words for reliability of each covid19 tweets and online articles have been visualized and compared before and after pre-processing procedure. After preparing the datasets and making sure that both datasets have the same characterstics Union merging fashion was suitable approach for combining the prepared datasets and after performing the merging operation the results was logical. Finally the combined cleaned dataset has been saved into a csv for preparation (modelling) phase.